Dispersal history and bidirectional human-fish host switching of invasive, hypervirulent Streptococcus agalactiae sequence type 283

Human group B Streptococcus (GBS) infections attributable to an invasive, hypervirulent sequence type (ST) 283 have been associated with freshwater fish consumption in Asia. The origin, geographic dispersion pathways and host transitions of GBS ST283 remain unresolved. We gather 328 ST283 isolate whole-genome sequences collected from humans and fish between 1998 and 2021, representing eleven countries across four continents. We apply Bayesian phylogeographic analyses to reconstruct the dispersal history of ST283 and combine ST283 phylogenies with genetic markers and host association to investigate host switching and the gain and loss of antimicrobial resistance and virulence factor genes. Initial dispersal within Asia followed ST283 emergence in the early 1980s, with Singapore, Thailand and Hong Kong observed as early transmission hubs. Subsequent intercontinental dispersal originating from Vietnam began in the decade commencing 2001, demonstrating ST283 holds potential to expand geographically. Furthermore, we observe bidirectional host switching, with the detection of more frequent human-to-fish than fish-to-human transitions, suggesting that sound wastewater management, hygiene and sanitation may help to interrupt chains of transmission between hosts. We also show that antimicrobial resistance and virulence factor genes were lost more frequently than gained across the evolutionary history of ST283. Our findings highlight the need for enhanced surveillance, clinical awareness, and targeted risk mitigation to limit transmission and reduce the impact of an emerging pathogen associated with a high-growth aquaculture industry.


Methods in S1 Text
Whole genome sequence analysis.
A total of 317 sequences were obtained as assemblies, and 11 as whole-genome sequencing reads.Quality of whole-genome sequence data was assessed using FastQC v.0.11.9 [1].Wholegenome sequence paired-end reads were assembled using SPAdes v.3.15.2 [2] with the "-careful" option.Assemblies were mapped to reference genome SG-M1 (accession CP012419.2) [3] and aligned using SKA v.1.0[4].Putative regions of recombination were identified on the resulting 2.12 Mb whole genome alignment using Gubbins [5].Putative mobile genetic elements (MGE) were predicted and masked from the alignment, after which regions of recombination identified using Gubbins were purged, yielding a MGE-free and putatively non-recombinant alignment.Single nucleotide polymorphisms (SNP) were called on this alignment using SNPsites [6].

Prediction of mobile genetic elements.
The following mobile genetic elements were predicted in the SG-M1 genome: phages/prophages, CRISPR arrays, and insertion sequences/transposons (Table A in S1 Text).
Phage/prophages were predicted in the SG-M1 reference genome (accession CP012419.2) using the web version of PHASTER (https://phaster.ca/) and were used as reported.CRISPR arrays were predicted using CRT v1.2 [7], CRISPRDetect v2.4 [8], and PILER-CR v1.06 [9].In all cases, the SG-M1 genome sequence was used as input with default parameters selected.The three sets of results were manually inspected and found to differ by only one divergent repeat sequence in each of the two predicted arrays.We included the coordinates for the divergent repeat in our final definition of the extent of both arrays.Insertion sequences/transposons were predicted using the ISEFinder database (with the BLASTN tool at https://isfinder.biotoul.fr/blast.phpon March 20, 2022) and digIS (cloned from GitHub on March 20, 2022) [10].The results were manually reconciled yielding a total of 16 predictions, of which 10 predictions were nearly identical using both methods (start and end within 4 bp).For three predictions, one of the programs predicted a sequence that was completely contained in a prediction from the other program; we took the larger prediction in these cases.Two predictions were unique to digIS, and both were ORF predictions (as opposed to complete IS predictions); we accepted these to prevent potentially spurious SNPs from other IS/Tn-related genes that might align to these individual ORFs.Finally, one prediction consisted of an IS predicted identically by both programs, but digIS further predicted a partial ORF that extended beyond one border of the identically predicted IS; we took the union of these coordinates as the final prediction to guard against potentially spurious SNPs.

Prediction of antimicrobial resistance and virulence factor genes.
Sequences were screened for antimicrobial resistance and virulence factor genes with Abricate [11] using the NCBI, Resfinder, ARG-ANNOT and VFDB databases (with database updates available as of 29 March 2022).AMRFinderPlus [12] was used to further screen sequences for both acquired antimicrobial resistance genes and point mutations using the Streptococcus agalactiae tag.Results were manually reconciled and genes identified in any of the databases with a minimum coverage match of 80% to the query sequence length were classified as "present"; gene coverage match below this threshold was classified as "absent."

Phylogenetic analysis.
A maximum likelihood (ML) phylogenetic tree was inferred from the 1,214 SNP alignment using RAxML v8.2.12 [13] with a general time reversible (GTR) and gamma model of rate heterogeneity.We performed 100 bootstrap replications to assess internal node support.A rootto-tip regression was performed based on the ML tree to evaluate the temporal signal inherent to our dataset.For this purpose, we used the R package "BactDating" [14] and observed a relatively weak yet supported correlation between date of sampling and genetic divergence (R 2 = 0.22; p < 0.0001; Fig A in S1 Text).Trees were visualized and annotated using the R package "ggtree" v3.2.1 [15] and the program FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).

Spatio-temporal and discrete trait analyses.
To circumvent convergence issues with a joint inference approach and reduce computational burden, we used a multi-step process to generate spatial reconstructions of GBS ST283 transitions and to infer the gain and loss of antimicrobial resistance and virulence factor genes.In the first step we inferred a time-scaled phylogenetic tree from the ML tree using Markov chain Monte Carlo (MCMC) simulations over 10 million generations in the R package "BactDating," with convergence confirmed through visual inspection of traces and effective sample sizes (ESS) ranging between 204 and 501.In the second step, we used this time-scaled phylogenetic tree as a fixed tree topology to perform both a discrete phylogeographic reconstruction and to analyze gene transition rates using the discrete diffusion model [16] implemented in the software package BEAST v1.10 [17].
Specifically, a Bayesian stochastic search variable selection (BSSVS) approach was used to identify state transitions for discrete traits.Ancestral trait reconstructions were performed for location, antimicrobial resistance genes, and virulence factor genes.The MCMC algorithm was run for two billion iterations, sampling every 10 5 iterations.MCMC convergence and mixing, as well as ESS values associated with estimated parameters were checked using Tracer v1.7 [18].
All ESS values were >200.The maximum clade credibility (MCC) tree was identified and annotated with the program TreeAnnotator v1.10 [17] among 1,000 trees sampled from the postburn-in posterior distribution.
In the discrete phylogeographic reconstruction, country transitions were estimated as Markov jump counts in the BSSVS analysis.Markov jumps were reported along with both standard Bayes factor (BF) support values (ratio of posterior over prior odds and interpreted as a measure of the strength of evidence for the alternate hypothesis) and an adjusted Bayes factor (BFadj) [19].
As disparity in sample sizes could influence transitions between states, the BFadj value accounts for relative abundance in sample sizes for discrete traits by generating a prior with randomized traits over tree tips (tip swapping) in each MCMC iteration in a manner that approximates the tip-date randomization test for temporal signal [20].Standard BF and BFadj values >20 were considered strong statistical support [21].
To account for heterogeneous sampling orientation amongst host type in the dataset, we inferred host switching by working on downsampled time-scaled phylogenetic trees obtained by randomly sampling equal numbers of isolates originating from human and fish hosts in the tree tips.We generated 1,000 downsampled phylogenies, performing in each tree a maximum likelihood ancestral host estimation using an equal rates model implemented in the R package "ape" and counting host state transitions (human-to-fish and fish-to-human).The transition counts in each tree were taken as the distribution from which the median and interquartile range (IQR) were calculated.
As the dataset was over-represented by human origin isolates in the earliest years, we conducted three additional analyses to assess the impact of sampling date on host switching (Fig 3).In the first analysis, we worked with a downsampled dataset consisting of equal numbers of isolates originating from human and fish hosts using only isolates within a window of years where both human and fish host isolates were represented.In the second analysis, we worked with the downsampled dataset from the first analysis, but further time-matched isolates by year and host.
Specifically, an equal number of isolates originating from human and fish hosts per year were retained, where that number was defined as the minimum number of isolates available for either one of the two host types in that year.Finally, the third analysis followed the procedure for the second analysis but was based on an original tree that was first subsampled by phylogenetic clustering to account for the uneven phylogenetic diversity associated with each host typenamely, sequences collected in fish having here a greater tendency to cluster within monophyletic clades.Specifically, subsampling by phylogenetic clustering consisted of removing sequences such that monophyletic clusters of sequences collected from the same host type are only represented by a single sequence [22].In the context of the ancestral reconstruction of the host trait, such clusters would largely represent dispersal within the same type of host.Therefore, performing this preliminary subsampling step prior to the subsampling by year and by host prevents giving too much weight to monophyletic clades gathering several sequences sampled from the same year and collected within the same type of host.All resulting downsampled datasets were then subjected to ancestral host reconstructions across 1,000 replicates using the method described above.

Across-trait host-gene and gene-gene correlations.
We apply a recently developed phylogenetic multivariate probit model [23,24], which can efficiently learn the correlation between discrete traits while adjusting for across-taxa covariation inherent to the phylogenetic tree.This model assumes continuous latent variables underlying discrete traits where the latent variables follow a Brownian diffusion along the tree.The acrosstrait partial correlations describe conditional dependencies between any two traits without confounding from other considered traits [23].As compared with empirical correlations (Fig E in S1 Text), which do not account for the inherent covariation amongst taxa in the phylogenetic tree, these partial correlations provide enhanced insights into potential molecular mechanisms without the confounding influence of the shared evolutionary history between lineages.We performed Bayesian inference of the phylogenetic probit model [23] implemented in BEAST v1. 10 [17] and reported the pairwise partial correlations with a posterior median > 0.2 or < -0.2 (Fig 5 Posterior probabilities for ancestral root node location are displayed for each country using both standard and tip swapping analyses (see Methods for further detail).Comparatively high posterior probability assigned to Singapore at the root node of the phylogeny using both standard and tip swapping analyses indicates that higher sampling effort in Singapore is associated with this result, and no conclusion can be drawn regarding the location origin of ST283 from the currently available dataset.

Fig
Fig A -E in S1 Text

Fig A .
Fig A.Root-to-tip linear regression analysis based on the maximum likelihood (ML) phylogenetic tree (left), which demonstrates a relatively weak yet supported correlation (R 2 = 0.22, p < 0.0001) between date of sampling and genetic divergence (right).

Fig
Fig B. Node age confidence intervals and bootstrap node support for GBS ST283.(a) Inferred node age 95% highest posterior density confidence intervals are displayed as blue bars in the time-scaled phylogenetic tree.(b) Unrooted maximum likelihood phylogenetic tree inferred from the 1214 SNP alignment (GTR gamma, 100 bootstrap replications using RAxML v8.2.12) showing bootstrap node support values.

Fig D .
Fig D. Discrete phylogeographic analysis of the dispersal history of GBS ST283between 1981 and 2021.Intracontinental and intercontinental transition events are inferred as Markov jumps.Maps display transition events by decade and are accompanied by circular migration flow plots, in which transitions out of a country are represented by arrows originating at the outer ring and ending in an arrowhead offset from the destination country.Arrow width is proportional to the magnitude of the Markov jumps.Only transition events associated with standard (non-adjusted) Bayes factor support > 20 are displayed, a threshold value corresponding to strong statistical support (see the Methods section for further detail).

Table A. Mobile genetic elements (CRISPR, phage, IS/Transposons) predicted in SG-M1. 34
). Predicted mobile genetic elements were masked from the whole genome alignment.